Exploitation-Exploration Dilemma

In this work, we will study a computational way to solve the exploitation-exploration dilemma (or trade-off). In the previous exercice, we only modeled the prediction of the presence or the absence of the reward. But the "animal" wasn't coded to do actions, only predictions. The exploration-exploitation dilemma emerges when you try to correlate the prediction of the animal to a choice of action. To better understand this dilemma, let's expose a situation.

In this situation, a bee B in an environment E is searching for pollen as reward $r$. In E, it happens to be two kinds of flowers, the blue flowers $b$ and the yellow flowers $y$ with different rewards such as $r_b = 8$ and $r_y = 2$. At each steps, B, as a bee, must collect pollen. To do it, it have to choose an action between two choices : go to $b$ and collect the pollen $r_b$ or go to $y$ and collect the pollen $r_y$. Now let's suppose that the bee is learning from experience.

Here comes a problem : What rule should B apply in order to maximize its reward ? The answer is quite simple : B should try the two flowers and then always go to the flower with maximum reward which is $b$. This solves the problem. But now, let's imagine that E, as a realistic environment, is changing. The variation is that after a certain number of steps, the season changes and so is the reward associated with $b$ and $y$. In summer, $r_b = 8$ and $r_y = 2$, but in autumn, a better season for yellow flowers, the reward is inverted such as $r_b = 2$ and $r_y = 8$. Now it is clear that B must be capable to adapt to the environment changes. If B learned during summer that $b$ is the best option and she decides to go systematically to collect $r_b$, when autumn comes, B will continue to systematically go to b despite the fact $y$ is now its best option. This is the exploration-exploitation dilemma : What rule should B apply in order to maximize its reward (exploitation) while still exploring less interesting options in the case rewards are subjects to change (exploration) ? In summer, the choice to go collect $r_b$ is the exploitation choice, and the choice to go collect $r_y$ is the exploration choice, but in autumn it inverses. How to balance between these two choices ? We have to create some exploitation-exploration trade-off.



In [1]:

    
%matplotlib inline
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 6)









    



Populating the interactive namespace from numpy and matplotlib

This trade-off is described by a function to compute the probability to go to $b$ (and the probability to go to $y$ as $p_y = 1-p_b$). This function is :

$$ p_b = \frac{1}{1+\exp(\beta(m_y-m_b))} $$

where $\beta$ is the exploration-exploitation parameter which acts as a factor on the difference $m_y-m_b$.

In order to study this problem we will first create a simple program which plot the evolution of the probability that B goes to the blue flowers depending on the different values of $\beta$.



In [2]:

    
from matplotlib import pyplot as plt
import numpy as np
import math, random

pb = []  #  List of probabilities of the action to go to blue flowers
betas = np.arange(0, 100, 1)  #  List of betas

for i in range(100):
    pbi = 1/(1+math.exp(betas[i]*0.5))  #  Compute the outcome of the strategy
    pb.append(pbi)

plt.plot(betas, pb, label='Probability to go to blue flowers pb')
plt.legend(bbox_to_anchor=(1, 1))
plt.title('Probability of the action to go to blue flowers depending on the value of beta')
plt.xlabel('beta')
plt.ylabel('Probability of occurence')
plt.show()

We observe that more $\beta$ is high, more $p_b$ tends to 0. This is a normal consequence since more $\beta$ is large, more the denominator will be large then more the fraction will be close to 0. Now, let's plot the evolution of $p_b$ in regard to a varying $m_y-m_b$.



In [3]:

    
pb = []  #  List of probabilities of the action to go to blue flowers
mymb = np.arange(0, 100, 1)  #  List of my-mb
beta = 0.5  #  Exploitation-exploration parameter

for i in range(100):
    pbi = 1/(1+math.exp(beta*mymb[i]))  #  Compute the outcome of the strategy
    pb.append(pbi)

plt.plot(mymb, pb, label='Probability to go to blue flowers pb')
plt.legend(bbox_to_anchor=(1, 1))
plt.title('Probability of the action to go to blue flowers depending on my-mb')
plt.xlabel('my-mb')
plt.ylabel('Probability of occurence')
plt.show()

We observe a similar effect for the same reason. Nevertheless, here it's not the $\beta$ parameter that is in cause but the difference between estimated rewards. This difference increases in this simulation which means that progressively the estimated reward $m_y$ becomes more and more high than the estimated reward $m_b$. This is why the probability of the action to go to the blue flower decreases and tends to zero. Then, the exploration-exploitation strategy defined by the function to compute $p_b$ is effective because we expect that the probability to go to the blue flower decreases if the estimated reward $m_y$ is superior to $m_b$, which implies an higher result of $m_y-m_b$.

Dumb bee

Now let's create a bee which is dumb in the sense that it has no learning skills. Its choices only depends on the function to compute $p_b$. What does $p_b$ looks like if $\beta = 0$ ?



In [4]:

    
pb = []  #  List of probabilities of the action to go to blue flowers
ychoices = []  #  Choices to go to yellow flowers
bchoices = []  #  Choices to go to blue flowers
time = np.arange(0, 200, 1)  #  Time axis
beta = 0  #  Exploitation-exploration parameter
my = 5  #  Estimated reward of yellow flowers
mb = 0  #  Estimated reward of blue flowers

for i in range(200):
    pbi = 1/(1+math.exp(beta*(my-mb)))
    pb.append(pbi)
    rand = random.random()
    if rand < pbi:
        bchoices.append(1)
        ychoices.append(0)
    else:
        bchoices.append(0)
        ychoices.append(1)

plot = plt.subplot()

plot.plot(time, pb, label='Probability to go to blue flowers pb')
plot.plot(time, bchoices, '+b', label='Choice to go to blue flowers')
plot.plot(time, ychoices, 'xy', label='Choice to go to yellow flowers')
plt.title('Probability of the action to go to blue flowers depending on time (Dumb bee + 0 beta)')
plt.xlabel('Time')
plt.ylabel('Probability of occurence')
plt.legend(bbox_to_anchor=(1, 0.8))
plt.ylim(-0.04, 1.04)
plt.show()
print("pb = " + str(pb[0]))

Probability that the bee goes to the blue flower is equal to 0.5 and probability that the bee goes to the yellow flower is also equals to 0.5. Why ? Because $\beta = 0$. With a $\beta$ equals to zero we have a purely explorative behavior : the bee acts randomly and explore the blue flowers in the same amount as its exploration of yellow flowers. If $ \beta = 0 $, then $ \beta(m_y-m_b) = 0 $, then $ \exp(\beta(m_y-m_b)) = 1 $, then : $$ p_b = \frac{1}{1+\exp(\beta(m_y-m_b))} = 0.5 $$

Now let's see what happens if $\beta > 0$ :



In [5]:

    
pb = []  #  List of probabilities of the action to go to blue flowers
ychoices = []  #  Choices to go to yellow flowers
bchoices = []  #  Choices to go to blue flowers
time = np.arange(0, 200, 1)  #  Time axis
beta = 0.8  #  Exploitation-exploration parameter
my = 5  #  Estimated reward of yellow flowers
mb = 0  #  Estimated reward of blue flowers

for i in range(200):
    pbi = 1/(1+math.exp(beta*(my-mb)))
    pb.append(pbi)
    rand = random.random()
    if rand < pbi:
        bchoices.append(1)
        ychoices.append(0)
    else:
        bchoices.append(0)
        ychoices.append(1)

plot = plt.subplot()

plot.plot(time, pb, label='Probability to go to blue flowers pb')
plot.plot(time, bchoices, '+b', label='Choice to go to blue flowers')
plot.plot(time, ychoices, 'xy', label='Choice to go to yellow flowers')
plt.title('Probability of the action to go to blue flowers depending on time (Dumb bee + high beta)')
plt.xlabel('Time')
plt.ylabel('Probability of occurence')
plt.legend(bbox_to_anchor=(1, 0.8))
plt.ylim(-0.04, 1.04)
plt.show()
print("pb = " + str(pb[0]))









    












    



pb = 0.0179862099621

If $\beta > 0$, the bee is more in a exploitative strategy. Even if it has no learning skill, it "understands" that, because its expectation of the reward of the yellow flower $m_y$ is bigger than the expectation of the reward of the blue flower $m_b$, it is more interesting for it to massively go to the yellow flower. Because $\beta > 0$ and $m_y > m_b$, the $\beta(m_y-m_b)$ part of the function is superior to zero. This implies that $ 1 + \exp(\beta(m_y-m_b)) > 2$ and then : $$ p_b = \frac{1}{1+\exp(\beta(m_y-m_b))} < 0.5 $$

The exploitation-exploration parameter $\beta$ acts as a weight in regards to the difference between the two estimated rewards. If $\beta = 0$, the bee is purely explorative, if $\beta >= 0$, the bee is strongly exploitative, if $0 < \beta < 1$, the bee is balancing exploitation and exploration and if $\beta < 0$, the bee will prefer the less interesting flower, which means the bee will acts worst than if it was choosing randomly.

Smart bee

Now let us give a learning skill to the bee in the form of the Rescorla Wagner rule and let's see what happens. Now the expectations $m_y$ and $m_b$ is varying depending on the environment. Let us consider also that the bee has a natural bias : it thinks that yellow flowers bears a reward of 5 and blue flowers bears no reward.



In [6]:

    
pb = []  #  List of probabilities of the action to go to blue flowers
ychoices = []  #  Choices to go to yellow flowers
bchoices = []  #  Choices to go to blue flowers
time = np.arange(0, 200, 1)  #  Time axis
beta = 0.2  #  Exploitation-exploration parameter
epsi = 0.2  #  Learning rate parameter
my = 5  #  Estimated reward of yellow flowers
mb = 0  #  Estimated reward of blue flowers

for i in range(200):
    if i < 100:
        rb = 8
        ry = 2
    else:
        rb = 2
        ry = 8

    pbi = 1/(1+math.exp(beta*(my-mb)))
    pb.append(pbi)
    rand = random.random()
    if rand < pbi:
        bchoices.append(1)
        ychoices.append(0)
        mb = mb + epsi*(rb-mb)
    else:
        bchoices.append(0)
        ychoices.append(1)
        my = my + epsi*(ry-my)

plot = plt.subplot()

plot.plot(time, pb, label='Probability to go to blue flowers pb')
plot.plot(time, bchoices, '+b', label='Choice to go to blue flowers')
plot.plot(time, ychoices, 'xy', label='Choice to go to yellow flowers')
plt.title('Probability of the action to go to blue flowers depending on time (Smart bee + 0.8 beta)')
plt.xlabel('Time')
plt.ylabel('Probability of occurence')
plt.legend(bbox_to_anchor=(1, 0.8))
plt.ylim(-0.04, 1.04)
plt.show()

What happens here ? The bee starts with an assumption such as it thinks that yellow flowers are better. For reasons exposed sooner, then $p_b < 0.5$. But it happens that the assumption of the bee is actually wrong. Blue flowers bears a reward of 8 instead of 0 and yellow flowers bears a reward of 2 instead of 5. The bee learns now from experience using Rescorla Wagner rule. During the first day, it learns that blue flowers are more interesting and then $p_b$ increases because $m_y-m_b$ decreases. During the second day, the reward beared by each flowers inverts. Then $p_b$ decreases because $m_y-m_b$ increases.The bee learns now that yellow flowers are more interesting. This is possible because the bee still explore due to a $\beta < 1$. The probability $p_b$ saturates to a threshold determined by the value of $\beta$.

This can be illustrated if we plot the same program but with $\beta = 0$ and $\beta = 1$.



In [7]:

    
pb = []  #  List of probabilities of the action to go to blue flowers
ychoices = []  #  Choices to go to yellow flowers
bchoices = []  #  Choices to go to blue flowers
time = np.arange(0, 200, 1)  #  Time axis
beta = 0  #  Exploitation-exploration parameter
epsi = 0.2 #  Learning rate parameter
my = 5  #  Estimated reward of yellow flowers
mb = 0  #  Estimated reward of blue flowers

for i in range(200):
    if i < 100:
        rb = 8
        ry = 2
    else:
        rb = 2
        ry = 8

    pbi = 1/(1+math.exp(beta*(my-mb)))
    pb.append(pbi)
    rand = random.random()
    if rand < pbi:
        bchoices.append(1)
        ychoices.append(0)
        mb = mb + epsi*(rb-mb)
    else:
        bchoices.append(0)
        ychoices.append(1)
        my = my + epsi*(ry-my)

plot = plt.subplot()

plot.plot(time, pb, label='Probability to go to blue flowers pb')
plot.plot(time, bchoices, '+b', label='Choice to go to blue flowers')
plot.plot(time, ychoices, 'xy', label='Choice to go to yellow flowers')
plt.title('Probability of the action to go to blue flowers depending on time (Smart bee + 0 beta)')
plt.xlabel('Time')
plt.ylabel('Probability of occurence')
plt.legend(bbox_to_anchor=(1, 0.8))
plt.ylim(-0.04, 1.04)
plt.show()



In [8]:

    
pb = []  #  List of probabilities of the action to go to blue flowers
ychoices = []  #  Choices to go to yellow flowers
bchoices = []  #  Choices to go to blue flowers
time = np.arange(0, 200, 1)  #  Time axis
beta = 1  #  Exploitation-exploration parameter
epsi = 0.2 #  Learning rate parameter
my = 5  #  Estimated reward of yellow flowers
mb = 0  #  Estimated reward of blue flowers

for i in range(200):
    if i < 100:
        rb = 8
        ry = 2
    else:
        rb = 2
        ry = 8

    pbi = 1/(1+math.exp(beta*(my-mb)))
    pb.append(pbi)
    rand = random.random()
    if rand < pbi:
        bchoices.append(1)
        ychoices.append(0)
        mb = mb + epsi*(rb-mb)
    else:
        bchoices.append(0)
        ychoices.append(1)
        my = my + epsi*(ry-my)

plot = plt.subplot()

plot.plot(time, pb, label='Probability to go to blue flowers pb')
plot.plot(time, bchoices, '+b', label='Choice to go to blue flowers')
plot.plot(time, ychoices, 'xy', label='Choice to go to yellow flowers')
plt.title('Probability of the action to go to blue flowers depending on time (Smart bee + high beta)')
plt.xlabel('Time')
plt.ylabel('Probability of occurence')
plt.legend(bbox_to_anchor=(1, 0.8))
plt.ylim(-0.04, 1.04)
plt.show()

We see empirically that, when assumptions of the bee are false, the bee learns until it attains a threshold determined by $\beta$. With $\beta = 0$, the bee adopts a purely explorative strategy as we have seen sooner. With $0 < \beta < 1$ the bee adopts a balancing strategy where it exploit to a maximum determined by $\beta$ because this value acts as a factor for the difference between the two estimated rewards. When $\beta = 1$ the bee adopts a strongly exploitative behaviour, when the paramters of the environment are learned, it only goes to blue flowers. But how does the bee can then changes its behaviour when day two comes ? We could suppose that, because bee never go anymore to yellow flowers, it can't discover that $r_y$ is now superior to $r_b$. But the fact is that the bee also perceives the diminution of reward $r_b$ and then when it updates its estimation of the reward of blue flowers, this estimation decreases and then $m_y-m_b$ increases which causes $p_b$ to decreases to. Because $p_b$ decreases, yellow flowers have more chances to be explored, and then the estimation $m_y$ has more chances to be updated and to increase which cause $m_y-m_b$ to increases, etc ...

This purely exploitative strategy poses problem when the best action still gives the same reward but when another action becomes preferable. The bee has then no way to discover that its action is no longer the best. This is an example of what i'm saying :



In [9]:

    
pb = []  #  List of probabilities of the action to go to blue flowers
ychoices = []  #  Choices to go to yellow flowers
bchoices = []  #  Choices to go to blue flowers
time = np.arange(0, 200, 1)  #  Time axis
beta = 1  #  Exploitation-exploration parameter
epsi = 0.2  #  Learning rate parameter
my = 5  #  Estimated reward of yellow flowers
mb = 0  #  Estimated reward of blue flowers

for i in range(200):
    if i < 100:
        rb = 8
        ry = 2
    else:
        rb = 8
        ry = 10

    pbi = 1/(1+math.exp(beta*(my-mb)))
    pb.append(pbi)
    rand = random.random()
    if rand < pbi:
        bchoices.append(1)
        ychoices.append(0)
        mb = mb + epsi*(rb-mb)
    else:
        bchoices.append(0)
        ychoices.append(1)
        my = my + epsi*(ry-my)

plot = plt.subplot()

plot.plot(time, pb, label='Probability to go to blue flowers pb')
plot.plot(time, bchoices, '+b', label='Choice to go to blue flowers')
plot.plot(time, ychoices, 'xy', label='Choice to go to yellow flowers')
plt.title('Probability of the action to go to blue flowers depending on time (Smart bee + high beta)')
plt.xlabel('Time')
plt.ylabel('Probability of occurence')
plt.legend(bbox_to_anchor=(1, 0.8))
plt.ylim(-0.04, 1.04)
plt.show()

Nevertheless, because $p_b$ never attains 1, there will always be a moment where the bee will have a occurence of explorative strategy which will diminishes $p_b$ and enable the learning. To show it, let's add more time to the bee.



In [10]:

    
pb = []  #  List of probabilities of the action to go to blue flowers
ychoices = []  #  Choices to go to yellow flowers
bchoices = []  #  Choices to go to blue flowers
time = np.arange(0, 1000, 1)  #  Time axis
beta = 1  #  Exploitation-exploration parameter
epsi = 0.2  #  Learning rate parameter
my = 5  #  Estimated reward of yellow flowers
mb = 0  #  Estimated reward of blue flowers

for i in range(1000):
    if i < 100:
        rb = 8
        ry = 2
    else:
        rb = 8
        ry = 10

    pbi = 1/(1+math.exp(beta*(my-mb)))
    pb.append(pbi)
    rand = random.random()
    if rand < pbi:
        bchoices.append(1)
        ychoices.append(0)
        mb = mb + epsi*(rb-mb)
    else:
        bchoices.append(0)
        ychoices.append(1)
        my = my + epsi*(ry-my)

plot = plt.subplot()

plot.plot(time, pb, label='Probability to go to blue flowers pb')
plot.plot(time, bchoices, '+b', label='Choice to go to blue flowers')
plot.plot(time, ychoices, 'xy', label='Choice to go to yellow flowers')
plt.title('Probability of the action to go to blue flowers depending on time (Smart bee + high beta)')
plt.xlabel('Time')
plt.ylabel('Probability of occurence')
plt.legend(bbox_to_anchor=(1, 0.8))
plt.ylim(-0.04, 1.04)
plt.show()

(Note that even with 1000 steps, it is statistically possible that the bee never tries the yellow choice in this duration, so it is possible that the previous graph doesn't show what i'm saying.)